From archive to corpus: transcription and annotation in the creation of signed language corpora
نویسنده
چکیده
The essential characteristic of a signed language corpus is that it has been annotated, and not, contrary to the practice of many signed language researchers, that it has been transcribed. Annotations are necessary for corpus-based investigations of signed or spoken languages. Multi-media annotation software can now be used to transform a recording into a machine-readable text without it first being necessary to transcribe the text, provided that linguistic units are uniquely identified and annotations subsequently appended to these units. These unique identifiers are here referred to as ID-glosses. The use of ID-glosses is only possible if a reference lexical database (i.e., dictionary) exists as the result of prior foundation research into the lexicon. In short, the creators of signed language corpora should prioritize annotation above transcription, and ensure that signs are identified using unique gloss-based annotations. Without this the whole rationale for corpus-creation is undermined.
منابع مشابه
Creating a Corpus of Auslan within an Australian National Corpus
The creation of signed language (SL) corpora presents special challenges to linguists. They are face-to-face visual-gestural languages that have no widely accepted written forms or standard specialist notation system, making even superficial transcription problematic. SL corpora need to be created taking these facts into account. Using the example of Auslan (Australian Sign Language) this paper...
متن کاملThe Formosan Language Archive: Linguistic Analysis and Language Processing
In this paper, we deal with the linguistic analysis approach adopted in the Formosan Language Corpora, one of the three main information databases included in the Formosan Language Archive, and the language processing programs that have been built upon it. We first discuss problems related to the transcription of different language corpora. We then deal with annotation rules and standards. We g...
متن کاملEndangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR
This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...
متن کاملThe Sign Linguistics Corpora Network: Towards Standards for Signed Language Resources
The Sign Linguistics Corpora Network is a three-year network initiative that aims to collect existing knowledge and practices on the creation and use of signed language resources. The concrete goals are to organise a series of four workshops in 2009 and 2010, create a stable Internet location for such knowledge, and generate new ideas for employing the most recent technologies for the study of ...
متن کاملDictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application
The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008